-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plug file descriptor leaks #643
Plug file descriptor leaks #643
Conversation
Thanks for this PR/bug fix. I'll take a look tomorrow and give more feedback, in the mean time I've added a few reviewers. |
1 similar comment
Thanks for this PR/bug fix. I'll take a look tomorrow and give more feedback, in the mean time I've added a few reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for taking this on, @ssgier! This has been a problem from the start. If this indeed solves the problem, it would be an amazing contribution to Lava that makes a lot of lives easier!
I made a few renaming and style suggestions, nothing major.
Do you see a way of writing a unit test for this new functionality? One that ideally fails with the old implementation and passes with the new?
src/lava/magma/runtime/message_infrastructure/close_on_shutdown_smm.py
Outdated
Show resolved
Hide resolved
src/lava/magma/runtime/message_infrastructure/close_on_shutdown_smm.py
Outdated
Show resolved
Hide resolved
src/lava/magma/runtime/message_infrastructure/close_on_shutdown_smm.py
Outdated
Show resolved
Hide resolved
src/lava/magma/runtime/message_infrastructure/close_on_shutdown_smm.py
Outdated
Show resolved
Hide resolved
src/lava/magma/runtime/message_infrastructure/close_on_shutdown_smm.py
Outdated
Show resolved
Hide resolved
src/lava/magma/runtime/message_infrastructure/multiprocessing.py
Outdated
Show resolved
Hide resolved
Thanks for reviewing @mathisrichter! I will implement your suggestions during the coming days and will also think of a way to write some specific tests. Some follow-up informationBelow is a screenshot of my terminal emulator, showing proof of concept:
This means that with low ulimit, only the fix branch avoids the files error, but with high ulimit, both avoid it. This result is deterministic. It can be run any number of times with the same result. I am running:
However: I tested the fix today on my MacBook (macOS Ventura 13.2) and there it does not work. After going back to the Linux machine and taking a deeper look, I noticed that if I call the tests in a specific way (by repeating the same test a large number of times), some descriptors still leak, although much more slowly, eventually leading to the same error. Conclusion: This change fixes some leaks, but not all of them yet. At least one remains. I will try to track them all down and come back with an update soon. |
@ssgier Thanks for that deeper analysis! I brought up your PR internally and learned that there may be tests for this error in some branch. I asked @joyeshmishra to add them as a comment to this thread. |
Update: Found some more leaks and committed a draft of a fix here. New behavior: With these fixes applied, all file descriptor leaks appear to be resolved on both Linux and macOs and the whole test suite runs through with low ulimit. Next steps: I will implement @mathisrichter's suggestions and also write an explicit test. It turns out that there is a simple way of doing this:
Details
I tried to keep the fixes as non-intrusive as possible. One shortcoming is that some of them rely on objects being reclaimed, but it works in practice and from what I can see, it does not break existing functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution. Looks great. Expecting you have verified on all three platforms before merge. Small comment on code otherwise looks great,
@joyeshmishra thanks for reviewing! I did verification on Linux and macOS but not on Windows. Is resource leakage also a problem on Windows? It looked like a problem specific to Unix-like systems. |
As discussed separately with @mathisrichter, removed the now obsolete hint from README and tutorial. The tag referred in the README would have to be updated in the next release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks.
* Plug file descriptor leaks * Plug more leaks * Improve code style * Check higher wait time on CI run * Better fix for semaphore logic * Add unit test for file descriptor leakage * Let threads terminate asynchronously * Disable leakage test on Windows (not applicable) * Add more type hints * Improve naming * Remove hint about too many open files error
Issue Number: #642
Objective of pull request: fix file descriptor leaks.
Pull request checklist
Your PR fulfills the following requirements:
flakeheaven lint src/lava tests/
) and (bandit -r src/lava/.
) pass locallypytest
) passes locallyPull request type
Please check your PR type:
What is the current behavior?
What is the new behavior?
Does this introduce a breaking change?
Supplemental information
Key points all involve addressing sub-optimal usage of the Python multiprocessing library:
Note: the fix works for Linux. For other OSes it would have to be checked as well.